Thread vcov_type through StackedDiD (Phase 1b 2/8)#479
Conversation
Adds vcov_type ∈ {"classical","hc1","hc2","hc2_bm"} to StackedDiD with
clubSandwich R parity at atol=1e-10 on the hc2_bm path. Defaults to hc1
(prior behavior at machine precision).
Source edits:
- __init__ accepts vcov_type=, rejects classical/hc2 with cluster-
incompatibility ValueError (StackedDiD clusters intrinsically at unit
or unit_subexp; one-way families don't compose with cluster_ids per
the linalg validator), rejects conley with a deferral message.
- fit() rejects survey_design + vcov_type != "hc1" with
NotImplementedError (survey TSL/replicate-refit overrides analytical
sandwich; SA precedent). Reject order locked: fweight/aweight check
fires first, then vcov check (pinned by test_aweight_plus_hc2_bm_*).
- Main solve_ols call at :419: switched from bake-Q-into-X
(X_t = X*sqrt(Q)) to explicit `solve_ols(X, Y, weights=composed_weights,
vcov_type=self.vcov_type)`. solve_ols internally bakes Q for the coef
solve AND back-transforms for vcov on original-scale data via
clubSandwich's WLS-CR2 algebra for hc2_bm (PR #475).
- _refit_stacked closure at :444: mirror switch (cosmetic per
return_vcov=False but grep-consistency).
- StackedDiDResults gains vcov_type field; get_params() includes it.
- llms-full.txt agent-facing entry updated.
R parity:
- benchmarks/data/stacked_did_test_panel.csv (NEW) — pre-stacked panel
generated by extracting StackedDiD's internal stacked_data via the
fixed-seed Python fixture (n_units=50, n_periods=8, cohorts=[3,5,7],
seed=20260521; 325 rows after stacking).
- benchmarks/R/generate_stacked_did_golden.R (NEW) — loads CSV, fits
lm(weights=Q,...) with the same event-study design as Python, computes
CR1S (Stata-style) + CR2-BM SE + BM df_Satt at cluster=unit AND
cluster=unit_subexp via clubSandwich >= 0.7.0.
- benchmarks/data/stacked_did_golden.json (NEW) — committed goldens.
Tests:
- tests/test_stacked_did.py::TestStackedDiDVcovType (19 tests): default
+ bit-equality + reject paths + survey precedence + get_params/set_params
+ result-class field + clone idempotency + replicate-refit smoke +
reject-order regression + unit_subexp + survey symmetric.
- tests/test_methodology_stacked_did.py (4 tests, NEW): hc1 vs CR1S,
hc2_bm vs CR2 (unit), BM DOF vs coef_test()$df_Satt (via t-dist
inversion of CI half-width), hc2_bm vs CR2 (unit_subexp).
- Tolerance note: hc1 vs prior bake-Q-into-X is bit-equal up to ~2 ULPs
at SE scale (multiplication ordering in solve_ols(weights=) vs prior
user-side bake-w); pinned by test_hc1_se_bit_equal_to_pre_pr_baseline
at atol=1e-13.
Docs:
- REGISTRY.md StackedDiD section: new "Variance families" subsection
documenting hc1/hc2_bm routing + rejects, with **Note:** deferring
methodology framing to Phase 1a clubSandwich port (PR #475) — no new
methodology synthesis introduced in this PR.
- CHANGELOG.md: Unreleased Added bullet at top.
- TODO.md: Phase 1b row updated to track remaining 6 estimators; new
StackedDiD conley follow-up row mirroring SA precedent.
234 tests pass across stacked + methodology_stacked_did +
methodology_sun_abraham + methodology_twfe + methodology_wls_cr2 +
estimators_vcov_type.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (R1 P0) Local codex R1 caught a P0: StackedDiD(vcov_type="hc2_bm") computed CR2 vcov correctly but never propagated the Bell-McCaffrey Satterthwaite DOF into safe_inference() calls. event_study_effects[h]['p_value']/['conf_int'] and overall_p_value/overall_conf_int silently fell back to normal-theory inference (df=None ⇒ scipy.norm), contradicting the registry contract. Fix mirrors the SunAbraham aggregated-inference pattern from PR #472 (sun_abraham.py:997-1097). After solve_ols(), if vcov_type=="hc2_bm" and not on the survey replicate-refit path, build contrast matrix: - Per-event-time: unit vector at each interaction_indices[h] - Overall ATT: 1/K average across post-period interaction columns Call _compute_cr2_bm_contrast_dof(X, cluster_ids, bread, contrasts, weights=composed_weights). Apply per-event-time DOFs to event_study_effects inference, overall DOF to overall_* inference. Wrap in try/except so any rank-deficient or linalg failure emits a UserWarning and falls back to normal-theory (visible deviation, not silent). R fixture extended with the post-period-average ATT contrast DOF via Wald_test(constraints=row_avg, vcov=CR2, test="HTZ")$df_denom (mirrors PR #465's MPD avg_att approach). New goldens at both cluster=unit and cluster=unit_subexp. Test additions / strengthening (addresses R1 P2 + P3): - test_hc2_bm_per_event_dof_matches_coef_test_df_satt_unit_cluster: uses brentq inversion of CI half-width to recover the DOF safe_inference actually used. If propagation failed (the R1 P0 bug), the inversion raises ValueError → test FAILS instead of silently skipping (replaces the prior `continue`-on-failure pattern which could vacuously pass). Hard-asserts validated_count == len(event_times). - test_hc2_bm_overall_att_dof_matches_wald_test_htz_unit_cluster (NEW): pins overall ATT DOF at atol=1e-6 against R Wald_test(HTZ)$df_denom. - test_hc2_bm_overall_att_dof_matches_wald_test_htz_unit_subexp_cluster (NEW): symmetric coverage at alternate cluster level. - Renamed CR1 → CR1S throughout docs/tests/REGISTRY for consistency (diff-diff's HC1+cluster uses Stata-style G/(G-1)*(n-1)/(n-p); plain CR1 omits the (n-1)/(n-p) term and diverges by ~1.4%). 192 tests pass (74 stacked + 19 wls_cr2 + 47 SA + 52 estimators_vcov_type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local codex R2 caught a P1: on rank-deficient stacked designs, the BM contrast DOF block called `_compute_cr2_bm_contrast_dof()` on the singular full bread matrix, which raised LinAlgError and triggered the catch-and- fallback path — downgrading the aggregated inference to normal-theory even when the target event-time delta_h coefficients were still identified. Fix mirrors the MultiPeriodDiD rank-deficient pattern (PR #465, estimators.py:1860-1913): - Derive `_identified = ~np.isnan(coef)` (solve_ols emits NaN for dropped columns under R-style rank handling). - Subset X, bread, and contrasts to the identified-column block BEFORE calling `_compute_cr2_bm_contrast_dof`. - Only build per-event-time contrasts for event-times whose target delta_h column is identified; only build the overall ATT contrast when ALL post-period delta_h are identified (otherwise the contrast is undefined). - The remaining try/except is a defensive guard for genuine singularities on the identified-column design (rare; the rank-deficient handling already removes the problematic columns). The earlier behavior was too aggressive: a single dropped nuisance column would knock out BM DOF for ALL contrasts. The new behavior NaN-guards only contrasts that target a dropped column. New regression test `test_hc2_bm_rank_deficient_design_keeps_bm_dof_*` — verifies the fit doesn't emit fallback warnings on a well-conditioned design + checks that CI half-widths use t(BM DOF) instead of z=1.96 (catches the R1/R2 normal-fallback failure mode end-to-end). 75 tests pass across stacked + methodology_stacked_did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local codex R3 caught a P1 on the same surface: when BM contrast DOF was unavailable (helper raised OR noise-floor NaN guard fired), StackedDiD either fell back to normal-theory inference (silent wrong CIs/p-values under the hc2_bm contract) or passed NaN df through safe_inference (which guards df<=0 but NOT NaN, producing finite t_stat with NaN p/CI — mixed inconsistent inference fields). Both fail-open patterns contradict the registry contract that hc2_bm aggregated inference uses BM DOF. Fix mirrors the LinearRegression.get_inference() pattern from PR #475 R7 (linalg.py:3689-3706): on the hc2_bm path, when the per-contrast BM DOF is None or non-finite, emit ALL-NaN inference fields for that contrast rather than falling back to safe_inference(df=None or NaN). Effect and SE remain finite — only the inference fields downstream of the DOF are suppressed. Applied at both inference sites: - Event-study per-event-time inference (event_study_effects[h]): inline check before safe_inference call. - Overall ATT inference (overall_t/p/conf_int): same guard. New regression tests: - test_hc2_bm_nan_dof_fails_closed_with_all_nan_inference: monkeypatches `_compute_cr2_bm_contrast_dof` to return a NaN-only vector and asserts ALL event-study + overall t_stat/p_value/conf_int are NaN while effect and se remain finite. - test_hc2_bm_helper_raises_fails_closed_with_all_nan_inference: monkeypatches the helper to raise LinAlgError and asserts the fallback path NaN-closes inference (does NOT use normal-theory). 77 tests pass. Pattern is consistent with the analytical-surface fail-closed contract added in PR #475 R7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…drift lock) R4 verdict was ✅ with 3 polish items: P2: helper-exception warning text was stale (said "Falling back to normal distribution" but the code now NaN-closes per the R3 P1 fix). Rewrote the warning to accurately describe the fail-closed contract: "Aggregated p-values, t-statistics, and confidence intervals will be returned as NaN to preserve the hc2_bm contract". P3: set_params(vcov_type=...) bypassed the __init__ validation guard — sklearn-style mutation could set arbitrary values and fail later in the linalg layer with a different message. Fix: factored the vcov_type validation into `StackedDiD._validate_vcov_type(staticmethod)` and call it from BOTH __init__ AND set_params. New test `test_set_params_revalidates_vcov_type` exercises all 4 reject paths. P3: test_replicate_refit_coef_bit_equal_vs_bake_w_baseline was described as a "drift lock" but only checked finiteness — wouldn't catch numeric drift. Captured baseline overall_att=2.0807833161293945 + overall_se=0.14697256517699428 on the post-PR branch and asserted with atol=1e-13 (coef) / 1e-10 (SE, compounded through replicate-refit variance aggregation). Now the test actually catches drift. 78 tests pass (4 set_params reject paths + strengthened drift lock). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eficient test)
R5 verdict was ✅ with 2 P3 polish items:
P3 (Maintainability): the new solve_ols(weights=...) call sites relied on
the default weight_type="pweight". Locked the WLS-CR2 weight convention
locally by passing weight_type="pweight" explicitly in BOTH solve_ols
calls (main fit + _refit_stacked replicate closure). Prevents silent
variance-contract drift if solve_ols's default ever changes.
P3 (Docs/Tests): test_hc2_bm_rank_deficient_design_keeps_bm_dof_on_
identified_contrasts didn't actually create a rank-deficient design — it
only verified the t-distribution was used on a well-conditioned fixture.
Two-part fix:
1. Renamed the existing test to
test_hc2_bm_uses_t_distribution_not_normal_on_well_conditioned_design
(honest description: it's an end-to-end smoke check that BM DOF
threading reaches safe_inference).
2. NEW test test_hc2_bm_rank_deficient_keeps_bm_dof_on_identified_
contrasts: monkeypatches solve_ols to inject NaN at the intercept
column, simulating rank deficiency. Exercises the reduced-design
code path at stacked_did.py:529-577 added in the R2 P1 fix.
Verifies: (i) no fallback warning (reduced-design helper succeeds);
(ii) identified delta_h coefficients keep finite t-distribution
inference (CI half-width > z_0.975 * SE); (iii) overall ATT also
keeps t-distribution.
79 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 verdict was ✅ with 2 P3 test-coverage polish items: P3 (test): prior R fixture didn't pin the overall delta-method SE for the post-period-average ATT contrast — only BM DOF. Post-period covariance regressions could slip through. Fix: added `se_overall_cr1` and `se_overall_cr2` to the R fixture (computed as `c @ vcov @ c.T` for the post-period 1/K row contrast at both cluster levels). Added overall_se parity assertions at atol=1e-10 to the three existing hc1/hc2_bm parity tests. P3 (test): the Python parity test regenerated the panel via `generate_staggered_data(...)` instead of cross-checking against the committed CSV. Risk: future DGP-helper drift would silently invalidate R-parity assertions. Fix: added a DGP-drift check in the `panel` fixture that loads `benchmarks/data/stacked_did_test_panel.csv`, merges with the regenerated panel on (unit, period), and asserts the `outcome` column matches at atol=1e-12. Catches any drift in the DGP helper with a clear error message rather than wrong parity output. (first_treat column not checked because inf-vs-0 recoding differs between CSV/pandas-read and regenerated panel; outcome is what actually drives the R-side fit.) 79 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning R7 verdict was⚠️ Needs changes with one P1 + one P3: P1: BM-DOF logic was only pinned on default (anticipation=0, weighting='aggregate'). The BM-DOF code path uses anticipation in post-period selection (`h >= -self.anticipation`) and consumes the composed Q-weights directly — both surfaces were untested. Added 4 parameter-interaction tests: - test_anticipation_plus_hc2_bm_threads_bm_dof: anticipation=1, asserts shifted reference period (e=-2 has SE=0), overall ATT contrast uses t(BM DOF) (CI half-width > z_0.975 * SE), at least one event-study CI uses t-distribution. - test_population_weighting_plus_hc2_bm_finite_threads_bm_dof: weighting='population' (different Q-weight formula), verifies finite output + BM DOF propagation via t-distribution check. - test_sample_share_weighting_plus_hc2_bm_finite_threads_bm_dof: weighting='sample_share' (third Q-weight formula), symmetric coverage. - test_hc1_vs_hc2_bm_differ_under_anticipation: sanity check that hc1 vs hc2_bm SEs actually differ under anticipation=1 (proves the leverage/DOF adjustment fires under non-default params). These tests don't require R goldens (no methodology change vs default case; just verifies the BM-DOF code path remains active under the parameter variants). R parity on the default path is already strong. P3: fixture-independence concern from R6 was only partially closed. Now also pins the `first_treat` column equality after normalizing inf/0 between CSV (pandas reads inf for never-treated) and regenerated (0 for never-treated in some pandas versions). Catches cohort-assignment drift independently of outcome drift. 83 tests pass (+4 new R7 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion') R8 verdict was⚠️ still Needs changes: R7's parameter-interaction tests were smoke checks (population test used constant pop=1.0 which is numerically degenerate with aggregate weighting on a balanced panel); no R goldens for non-default variants. Closes the gap with numerical parity at atol=1e-10. Source-tree additions: - benchmarks/data/stacked_did_anticipation1_panel.csv (NEW): pre-stacked fixture with anticipation=1, generated from same base panel via StackedDiD(anticipation=1).fit(). Event times [-3,-2,-1,0,1,2] with ref=-2 (shifted by anticipation). - benchmarks/data/stacked_did_population_panel.csv (NEW): pre-stacked fixture with weighting='population' AND heterogeneous unit populations (rng.uniform(1,100), seed=99). Q-weights range [0.68, 1.54] vs uniform ~1.0 under aggregate weighting; forces the population-formula branch. - benchmarks/R/generate_stacked_did_golden.R: refactored to loop over 3 variants (default, population, anticipation1), producing variant-keyed entries in `stacked_did_golden.json`. Post-period filter now generalized to `h >= -anticipation` (matches Python's `_post_event_times_preview` at stacked_did.py:546-548). - benchmarks/data/stacked_did_golden.json: regenerated with all 3 variants. Test additions (tests/test_methodology_stacked_did.py): - new class TestStackedDiDParityRVariants (4 tests): - test_population_weighting_hc1_matches_clubsandwich_cr1s - test_population_weighting_hc2_bm_matches_clubsandwich_cr2 - test_anticipation1_hc1_matches_clubsandwich_cr1s - test_anticipation1_hc2_bm_matches_clubsandwich_cr2 All assert event-study SE AND overall_se at atol=1e-10. - new fixture `population_panel` reproduces the heterogeneous-population seed (np.random.default_rng(99)) used by the R variant CSV generator. 87 tests pass (+4 variant parity). Note: weighting='sample_share' parity not added because its Q-weight formula structurally mirrors population weighting (both differ from aggregate only in the Q-weight calculation, which feeds into the same solve_ols(weights=) path). The population test exercises that code path; sample_share remains covered by the smoke test from R7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9 verdict was⚠️ still on parameter-coverage P1: sample_share had no R-parity fixture, and population/anticipation1 tests asserted SE only (not BM DOF). Source-tree additions: - benchmarks/data/stacked_did_sample_share_panel.csv (NEW): pre-stacked fixture for weighting='sample_share'. - benchmarks/R/generate_stacked_did_golden.R: added sample_share variant to the variants list; loops over all 4 variants now. - benchmarks/data/stacked_did_golden.json: regenerated with sample_share. Test additions (tests/test_methodology_stacked_did.py): - test_sample_share_weighting_hc1_matches_clubsandwich_cr1s - test_sample_share_weighting_hc2_bm_matches_clubsandwich_cr2 (+ DOF) - Extended test_population_weighting_hc2_bm_matches_clubsandwich_cr2: added per-event-time + overall BM DOF parity via CI-width inversion. - Extended test_anticipation1_hc2_bm_matches_clubsandwich_cr2: same DOF parity additions. 89 tests pass. All 3 non-default variants now have full R parity: event-study SE + overall_se + per-event BM DOF + overall BM DOF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Overall assessment✅ Looks good Executive summary
MethodologyNo unmitigated findings. Cross-checking the change against Code QualityNo unmitigated findings. The new parameter is validated at construction and re-validation time ( Performance
Maintainability
Tech Debt
SecurityNo findings. Documentation/TestsNo unmitigated documentation/test findings in the PR itself. The added coverage is broad across Verification note: I could not run |
CI codex R1 verdict was ✅ with 2 polish items:
P2 (Maintainability): StackedDiDResults.summary() didn't surface the
analytical variance family (vcov_type), even though hc1 vs hc2_bm are
now materially different user-facing inference modes. Other result
classes (DiD/MPD/TWFE) already print a Variance: line via the shared
`_format_vcov_label` helper at results.py:49-89. Fix: added a Variance:
line to StackedDiDResults.summary() using the shared helper, suppressed
under survey_metadata (the survey block already surfaces inference
source). My earlier plan note ("mirror SA: don't add Variance: row")
was based on a wrong assumption — SA-results uses the same shared
helper, and StackedDiD inherits the same precedent.
P3 (Performance): on `vcov_type="hc2_bm"` fits with `aggregate=None`
or `"simple"`, the code built per-event Bell-McCaffrey contrast DOFs
that were never surfaced (only overall ATT is exposed). Avoidable
CR2 cost on every overall-only hc2_bm fit. Fix: gated the per-event
contrast construction (es_keys, es_cols_full) on
`aggregate == "event_study"`. The overall ATT contrast still gets
computed unconditionally (it's the user-facing scalar).
89 tests still pass — no behavioral regression on the event-study
path; overall-only fits are now slightly faster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessment✅ Looks good Executive summary
Methodology
Code QualityNo unmitigated findings. PerformanceNo unmitigated findings. The earlier overall-only CR2-BM extra-work issue appears addressed by gating per-event DOF construction to the event-study path in MaintainabilityNo additional unmitigated findings beyond the summary-metadata issue above. Tech Debt
SecurityNo findings. Documentation/TestsNo additional unmitigated findings in the PR content. I could not run |
R2 verdict was ✅ + 1 P2: the summary's Variance: line (added in R1)
mislabelled clustered StackedDiD fits as one-way ("HC1
heteroskedasticity-robust" / "HC2 + Bell-McCaffrey DOF (one-way, n=N)")
because I passed cluster_name=None to _format_vcov_label. StackedDiD
is intrinsically clustered.
Fix:
- Added `cluster_name` and `n_clusters` fields to StackedDiDResults
dataclass.
- Populated both at fit() time: cluster_name=self.cluster, n_clusters=
np.unique(cluster_ids).size (matches the actual realized count, not
config).
- summary() now passes cluster_name + n_clusters into
_format_vcov_label, rendering "CR1 cluster-robust at unit, G=50"
for hc1 and "CR2 Bell-McCaffrey cluster-robust at unit, G=50" for
hc2_bm.
New regression test test_summary_renders_clustered_variance_label
verifies:
- hc1 → label contains "CR1 cluster-robust at unit"
- hc2_bm → label contains "CR2 Bell-McCaffrey cluster-robust at unit"
- Neither contains "(one-way" (catches the R2 regression directly)
- G= cluster count present
- cluster="unit_subexp" → label contains "at unit_subexp"
90 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…ath (Phase 1b 3/8) WooldridgeDiD now accepts `vcov_type` for the OLS path, mirroring the SunAbraham PR igerber#472 / StackedDiD PR igerber#479 pattern: - `hc1` (default) preserves bit-equal within-transform CR1 behavior - `hc2_bm` / `hc2` / `classical` auto-route to full-dummy saturated design (FWL doesn't preserve the hat matrix; HC2 leverage + BM DOF need the full FE projection). Matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at atol=1e-10 on the 6 R-parity tests in tests/test_methodology_wooldridge.py. - Bell-McCaffrey Satterthwaite DOF threaded into overall ATT inference via `_compute_cr2_bm_contrast_dof`; fail-closed (all-NaN) when DOF unavailable, per feedback_bm_contrast_dof_fail_closed. - One-way `hc2`/`classical` auto-drop the unit auto-cluster (one-way families don't compose with cluster_ids). Explicit `cluster="X"` + one-way raises at the linalg validator. - `method ∈ {logit, poisson}` + `vcov_type != "hc1"` rejected at `__init__` (GLM CR2-BM derivation deferred to follow-up TODO row). - `SurveyDesign` + `vcov_type != "hc1"` rejected at `fit()` (survey TSL overrides analytical sandwich). - `n_bootstrap > 0` + one-way + `cluster=None` rejected at `fit()` (bootstrap is intrinsically clustered). WooldridgeDiDResults gains `vcov_type`, `cluster_name`, `n_clusters` fields for downstream introspection. Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs remaining). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
vcov_type ∈ {"hc1","hc2_bm"}through StackedDiD (Phase 1b 2/8 of the standalone-estimator threading initiative). HC1 default preserves prior behavior; hc2_bm routes CR2 Bell-McCaffrey through the clubSandwich WLS-CR2 port merged in PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475.classical/hc2rejected at__init__(StackedDiD is intrinsically clustered, nocluster=Noneopt-out).conleyrejected with a deferral message.event_study_effects[h]) and overall (overall_p_value/overall_conf_int) aggregated inference via_compute_cr2_bm_contrast_dof; mirrors the SunAbraham aggregated-inference pattern from PR Thread vcov_type through SunAbraham (Phase 1b 1/8) #472.estimators.py:1860-1913).LinearRegression.get_inference()from PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475 R7).survey_design=combined withvcov_type != "hc1"raisesNotImplementedError. Reject order locked: fweight/aweight check fires first (Q-weight semantics), then vcov check.set_params(vcov_type=...)re-validates via shared_validate_vcov_typehelper (no bypass of__init__guard).atol=1e-10againstclubSandwich::vcovCR(...) + coef_test()$df_Satt+Wald_test(test="HTZ")$df_denomon 4 panel variants: default-aggregate, heterogeneous-population, sample_share, anticipation=1.Methodology references
hc2_bm + weightsrow); no new methodology choice is introduced in this PR.clubSandwich::vcovCR(type="CR1S")(Stata-styleG/(G-1) * (n-1)/(n-p)correction) — NOT plainCR1which omits the(n-1)/(n-p)factor and diverges by ~1.4% on the test fixture. Documented in REGISTRY.Validation
tests/test_methodology_stacked_did.py(NEW, 12 tests covering R parity on 4 variants × hc1/hc2_bm surfaces + DOF parity);tests/test_stacked_did.pyextended withTestStackedDiDVcovType(29 tests: defaults + rejects + clone idempotency + replicate-refit drift lock + rank-deficient handling + fail-closed inference + parameter-interaction smoke tests).benchmarks/R/generate_stacked_did_golden.R(clubSandwich v0.7.0). 4 committed pre-stacked CSV panels.Security / privacy
Test plan
pytest tests/test_methodology_stacked_did.pypytest tests/test_stacked_did.pypytest tests/test_methodology_sun_abraham.py tests/test_methodology_twfe.py tests/test_methodology_wls_cr2.pyfeedback_local_codex_vs_ci_codex_divergence)🤖 Generated with Claude Code